Comment data was downloaded and processed (github) from the FCC. We'll do a little exploratory data analysis using that data (over 5GB) and then output some data into csv files for use in creating some graphics.

First, let's load the modules we're going to use and set some global variables:



In [7]:

    
#Modules
import pandas as pd # data frames
import sqlite3 as sql #sqlite for reading the database
import numpy as np # mathematics
import stats as stats # statistical routines
#import pandasql as pdsql # library to interface pandas and sqlite3 - doesn't work though...

#Globals
FILENAME = '/run/media/potterzot/My Passport/potterzot/data/fcc/nn_comments.db'

The data is stored in an sqlite3 database that we will use as the base to analyze. Initially I tried this using HDF5, but my laptop couldn't handle the memory requirements of so much text, and chunking with pandas wasn't very effective (or perhaps I should say, I wasn't very effective with it).



In [10]:

    
# Initialize the connection
db = sql.connect(FILENAME)
cursor = db.cursor()

# How many comments were there?
cursor.execute("SELECT COUNT(*) FROM comments")
cursor.fetchone()









    Out[10]:





(446719,)

So we've got almost 450,000 comments. I read somewhere that nearly 150,000 were the same though, so let's see how many unique comments we have.



In [13]:

    
cursor.execute("SELECT COUNT(*) FROM (SELECT DISTINCT text FROM comments)")
cursor.fetchone()









    Out[13]:





(445999,)

Hmm, well that's not in accordance with what the FCC said. We should look at that closer. And let's also get the important data moved into a dataframe if possible without breaking the bank



In [23]:

    
# See what fields are available and save them as a data frame
cursor.execute("PRAGMA table_info(comments)")
metadata = pd.DataFrame(cursor.fetchall(), columns=('id', 'field', 'type', '0', 'None', '02'))
metadata









    Out[23]:






  
    
      
      id
      field
      type
      0
      None
      02
    
  
  
    
      0 
        0
                        id
       TEXT
       0
       None
       0
    
    
      1 
        1
                 applicant
       TEXT
       0
       None
       0
    
    
      2 
        2
            applicant_sort
       TEXT
       0
       None
       0
    
    
      3 
        3
                    author
       TEXT
       0
       None
       0
    
    
      4 
        4
               author_sort
       TEXT
       0
       None
       0
    
    
      5 
        5
                     brief
        INT
       0
       None
       0
    
    
      6 
        6
                   stateCd
       TEXT
       0
       None
       0
    
    
      7 
        7
                      city
       TEXT
       0
       None
       0
    
    
      8 
        8
                       zip
       TEXT
       0
       None
       0
    
    
      9 
        9
                  daNumber
       TEXT
       0
       None
       0
    
    
      10
       10
         dateCommentPeriod
       TEXT
       0
       None
       0
    
    
      11
       11
          dateReplyComment
       TEXT
       0
       None
       0
    
    
      12
       12
                  dateRcpt
       TEXT
       0
       None
       0
    
    
      13
       13
              disseminated
       TEXT
       0
       None
       0
    
    
      14
       14
                   exParte
        INT
       0
       None
       0
    
    
      15
       15
                fileNumber
       TEXT
       0
       None
       0
    
    
      16
       16
                   lawfirm
       TEXT
       0
       None
       0
    
    
      17
       17
              lawfirm_sort
       TEXT
       0
       None
       0
    
    
      18
       18
                  modified
       TEXT
       0
       None
       0
    
    
      19
       19
                     pages
        INT
       0
       None
       0
    
    
      20
       20
                proceeding
       TEXT
       0
       None
       0
    
    
      21
       21
              reportNumber
       TEXT
       0
       None
       0
    
    
      22
       22
           regFlexAnalysis
        INT
       0
       None
       0
    
    
      23
       23
       smallBusinessImpact
        INT
       0
       None
       0
    
    
      24
       24
            submissionType
       TEXT
       0
       None
       0
    
    
      25
       25
                      text
       TEXT
       0
       None
       0
    
    
      26
       26
             viewingStatus
       TEXT
       0
       None
       0
    
    
      27
       27
                     score
       REAL
       0
       None
       0

Okay, so let's create one data frame that is the geographical and other important data, but not the text itself, since that is the major memory consumer.



In [26]:

    
# Create the query
q = "SELECT id, applicant, author, city, stateCD AS state, zip, pages, score, dateRcpt AS received FROM comments"
cursor.execute(q)
nontextdata = pd.DataFrame(cursor.fetchall(), columns = ('id', 'applicant', 'author', 'city', 'state', 'zip', 'pages', 'score', 'received'))



In [33]:

    
# See the number of unique names, cities, states, zips
for v in ['applicant', 'state', 'city', 'zip']:
    print(v+": "+str(len(pd.Series(nontextdata[v]).unique())))









    



applicant: 355404
state: 28873
city: 60
zip: 27364

28,873 cities and 60 states. A quick look show's we've got a few extra 'states' thrown in the comments.



In [37]:

    
nontextdata['city'].unique()









    Out[37]:





array(['TN', 'CA', 'SC', 'NY', 'VA', 'MI', 'OR', 'FL', 'PA', 'WI', 'MA',
       'NJ', 'WA', 'IN', 'OK', 'RI', 'NC', 'TX', 'CO', 'CT', 'AZ', 'DE',
       'GA', 'OH', 'MD', 'NM', 'ME', 'NH', 'MO', 'SD', 'KY', 'IL', 'IA',
       'AR', 'LA', 'VT', 'AL', 'KS', 'MN', 'NE', 'WV', 'NV', 'HI', 'AK',
       'ID', 'MS', 'UT', 'MT', 'DC', 'WY', 'ND', 'PR', 'VI', '', 'MP',
       'GU', 'FM', 'PW', 'MH', 'AS'], dtype=object)

Histograms can give a qood visual estimate of the distribution among states, though Edward Tufte might argue in favor of a simple table instead.



In [50]:

    
db.close()

	id	field	type	None
0	0	id	TEXT	None
1	1	applicant	TEXT	None
2	2	applicant_sort	TEXT	None
3	3	author	TEXT	None
4	4	author_sort	TEXT	None
5	5	brief	INT	None
6	6	stateCd	TEXT	None
7	7	city	TEXT	None
8	8	zip	TEXT	None
9	9	daNumber	TEXT	None
10	10	dateCommentPeriod	TEXT	None
11	11	dateReplyComment	TEXT	None
12	12	dateRcpt	TEXT	None
13	13	disseminated	TEXT	None
14	14	exParte	INT	None
15	15	fileNumber	TEXT	None
16	16	lawfirm	TEXT	None
17	17	lawfirm_sort	TEXT	None
18	18	modified	TEXT	None
19	19	pages	INT	None
20	20	proceeding	TEXT	None
21	21	reportNumber	TEXT	None
22	22	regFlexAnalysis	INT	None
23	23	smallBusinessImpact	INT	None
24	24	submissionType	TEXT	None
25	25	text	TEXT	None
26	26	viewingStatus	TEXT	None
27	27	score	REAL	None